Bilingual Tests with Swedish, Finnish and German Queries

نویسندگان

  • Turid Hedlund
  • Heikki Keskustalo
  • Ari Pirkola
  • Mikko Sepponen
  • Kalervo Järvelin
چکیده

We used a dictionary-based approach, and performed tests in the bilingual track with three language pairs, i.e., Swedish – English (Swe-Eng), Finnish – English (Fin-Eng), and German – English (Ger-Eng). All the source languages are compound languages, i.e., languages rich in compound words. A compound word refers to a multi-word expression where the component words are written together. Our main efforts were to develop techniques for the processing of compounds, to study different types of compound languages, and to study the effects query structuring in different languages. We designed and implemented a method for automated query construction in FIN SWE GER -> ENG. The goal of this process is to extract automatically topical information from sentences written in one of the source languages (FIN, SWE, GER) and to create a target language (ENG) query. The resulting query may be either structured or unstructured. Introduction NLP-techniques have been tested for IR and CLIR for several years. The point of view has been that linguistically motivated indexing would enable the catching of sense in text and in queries differently from the non-linguistic methods used in IR, for example weighting based on word occurrences. Traditional NLP-techniques are extended also to the sub-word level, i.e., morphological decomposition and stemming (Sparck Jones 1999). So far, any great success in increasing the quality of retrieval result due to these techniques have not been reported, compared to statistical methods. The language dependent linguistic features important to IR and CLIR are, for example, the number of homographic word forms, the way to treat compounds and gender features. The main problems associated with dictionary-based CLIR are 1) phrase identification and translation, 2) source language ambiguity, 3) translation ambiguity, 4) the coverage of dictionaries, 5) the processing of inflected words, and 6) untranslatable keys, in particular proper names spelled differently in different languages (Pirkola et al. 2000) Our approach to solve the general problems for bilingual CLIR is based on 1) normalisation in indexing, 2) stopword lists, 3) normalisation of topic words, 4) splitting of compounds, 5) recognition of the right components, 6) phrase composition in target language, 7) bilingual dictionaries, and 8) structured queries. All the source languages we use, Swedish, Finnish and German are languages rich in compounds, therefore it is essential to develop techniques for the processing of compounds. Our other main interest is to compare structured and unstructured queries to solve the ambiguity problem with CLIR. We used a model developed and tested for Finnish English CLIR by Pirkola (1998; Pirkola et al. 1999).

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

UTACLIR @ CLEF 2001: New Features for Handling Compound Words and Untranslatable Proper Names

We participated in CLEF’2001 with four automated bilingual runs. UTACLIR is an automatic query translation and construction system for cross-language information retrieval. The system automatically extracts topical information from request sentences written in one of the source languages and constructs a target language query, based on translations given by a translation dictionary. The new fea...

متن کامل

The University of Amsterdam at CLEF 2002

This paper describes the official runs of our team for CLEF 2002. We took part in the monolingual tasks for each of the seven non-English languages for which CLEF provides document collections (Dutch, Finnish, French, German, Italian, Spanish, and Swedish). We also conducted our first experiments for the bilingual task (English to Dutch, and English to German), and took part in the GIRT and Ama...

متن کامل

The University of Amsterdam at CLEF 2003

This paper describes our official runs for CLEF 2003. We took part in the monolingual task (for Dutch, Finnish, French, German, Italian, Russian, Spanish, and Swedish), and in the bilingual task (English to Russian, French to Dutch, German to Italian, Italian to Spanish). We also conducted our first experiments for the multilingual task (both multi-4 and multi-8), and took part in the GIRT task.

متن کامل

The EMIME Bilingual Database

This paper describes the collection of a bilingual database of Finnish/English and German/English data. In addition, the accents of the talkers in the database have been rated. English, German and Finnish listeners assessed the English, German and Finnish talkers’ degree of foreign accent in English. Native English listeners showed higher inter-listener agreement than non-native listeners. Furt...

متن کامل

Highly Relevant Documents Lost in CLIR: Experiments with Dictionary Translation and Pseudo-Relevance Feedback

Research on cross-language information retrieval (CLIR) has typically been restricted to settings using binary relevance assessments. In this paper, we present evaluation results for dictionary-based CLIR using graded relevance assessments in a best match retrieval environment. A text database containing newspaper articles and a related set of 35 search topics were used in the tests. First, mon...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2000